In a well-defined study, exploratory analysis may be largely unnecessary. Consider a scenario where the client has identified questions / hypotheses of interest and the additional variables that may predict or mediate the identified outcomes. An activity partner such as MSI was provided the time and resources to consult with stakeholders, scope out the target environment to identify and map the data generating process, and develop an inferential design by which the collected data and analytical routines would address the questions posed by the client. In this scenario, exploratory analysis may be entirely eliminated, and the time that may be spent on exploration is shifted to sensitivity analysis of the pre-identified analytical routines.
On the other hand, consider a scenario where data has been collected for one purpose, and later realized to have possible use with other, not yet identified, purposes. In this case, the initial analysis is entirely exploratory.
Data visualization
Data visualization tactics
geomtextpath
There are common situations where MSI is tasked with ongoing data collection and evaluation of a client activity. For example, the MENA MELS activity (2020-2024) was tasked with ongoing monitoring and evaluation of the Middle East Partnership for Peace Activity (MEPPA). MEPPA comprised grants to several local partners organized around the common motif of building bonds between different demographic groups. The primary outcome of interest was whether or not a participant in the grant activity reported a perceived increase in understanding the political, social, and economic situation and viewpoints of another group. Data were collected on a rolling basis across several implementing partners, across baseline and endline. That data are summarized as follows:
For purposes of client reporting, there are three ordinal responses across baseline and endline, for each of three types of viewpoint. There is insufficient data to conduct inferential tests at this level of granularity, but this data may be visualized descriptively.
The geomtextpath package offers the functionality to directly label line-based plots with text that is able to follow a curved path. Simply replace ‘geom_line’ with ‘geom_textpath’ and assign the variable to use as the label. The following figures illustrate.
Given that the three types of understanding of others’ situation are highly correlated, it makes sense to present these measures compactly as aspects of a deeper underlying construct. The patchwork library allows multiple ggplots to be assembled together as a single plot. The following figure illustrates.
pol + soc + ec +plot_annotation(title="How well do you understand the situation of others?")
A final use of presenting the data more compactly is to collapse the ordinal responses to binary, and collect the three measures as lines in a single plot. The following data captures each type of understanding as either fair or high understanding as one category, and basic understanding as the other category.
dat <-read_csv("data/short demo series/meppa item ladder.csv",show_col_types=F)dat_flx <- dat %>%flextable() %>%autofit() dat_flx
endline
n
perc
item
0
55
0.714
Political situation
1
86
0.827
Political situation
0
47
0.610
Social situation
1
87
0.853
Social situation
0
45
0.584
Economic situation
1
82
0.804
Economic situation
With this simplified data summary, the trendline for each type of understanding may now be collected in a single plot.
Note further the use of secondary axis breaks to illustrate the change score for each trendline from baseline to endline.
The R computing language allows for several ways to customize the use of labels in statistical or descriptive graphics. This short demo has illustrated MSI’s use of the geomtextpath package to place labels directly along the line or curve of a plot. This illustration used only straight lines between two points in time. For additional use cases of the geomtextpath package, see the package vignette.
Mapping
Much of our data is collected from surveys and has geographic coordinates associated with a point of collection, a house, or a city, etc. It can be useful to generate a map to get a sense of where the data is coming from. Mapping is a part of MSI’s exploratory data analysis process and is also used in developing a sampling plan and in producing high quality data vizualizations for clients. This section provides a brief introduction into mapping in R.
The most commonly used packages to handle spatial data are sf for vectors, terra for vectors and rasters, and raster for rasters. To visualize the data, two frequently used packages are tmap and ggplot2 packages.
To get started, we need to load our packages.
#run this line first if you have never used these packages before#install.packages(c("tidyverse", "sf", "tmap", "readr", "here"))library(tidyverse) #install the core tidyverse packages including ggplot2library(sf) #provides tools to work with vector data library(tmap) #for visualizing spatial datalibrary(readr) #functions for reading external datasets library(here) #to better locate files not in working directorylibrary(geodata) #to download administrative boundaries
Read in the data
#It is a csv file so I use the read_csv function and provide the file pathcities <-read_csv(here::here("../methods corner/Map demo/data/Madagascar_Cities.csv") , show_col_types =FALSE)#Observe the first few rows of dataDT::datatable(head(cities))
Now, for the administrative boundaries. Each of these are being read in using gadm() from the geodata package and then converted to an sf object with the st_as_sf() function. In the example, we download only the country boundary, but if we wanted regions or departments, we would simply change the level argument inside the gadm() function call to a 1 or a 2.
Country boundary
#This is only the country boundary. mdg <- geodata::gadm(country ="MDG" , level =0 , path =tempdir()) |>st_as_sf()
Convert the cities to an sf object
Remember that the cities object is a standard .csv with longitude and latitude columns, but it is not yet recognized as an sf object that has geographic properties. Here is how to convert it to an sf object with a single geometry column and a crs.
cities_sf <- cities |>st_as_sf(coords =c("Longitude", "Latitude") , crs =4326)#observe the first few rows of dataDT::datatable(head(cities_sf))
Make the map
The following code chunks and tabs walk through the process of making and improving a map in both tmap and ggplot2. In the example, cities are what we are plotting, but we could be plotting any variable of a dataset.
#the city names are long so we have to # make a bigger window to fit them. This isn't part of the normal process#make an object with the current bounding boxbbox_new <-st_bbox(mdg)#calculate the x and y ranges of the bboxxrange <- bbox_new$xmax - bbox_new$xmin # range of x valuesyrange <- bbox_new$ymax - bbox_new$ymin # range of y values#provide the new values for the 4 corners of the bbox bbox_new[1] <- bbox_new[1] - (0.7* xrange) # xmin - left bbox_new[3] <- bbox_new[3] + (0.75* xrange) # xmax - right bbox_new[2] <- bbox_new[2] - (0.1* yrange) # ymin - bottom bbox_new[4] <- bbox_new[4] + (0.1* yrange) # ymax - top#convert the bbox to a sf collection (sfc)bbox_new <- bbox_new |># take the bounding box ...st_as_sfc() # ... and make it a sf polygon#now plot the maptmap_mode("plot") +tm_shape(mdg, bbox = bbox_new) +tm_polygons() +tm_shape(cities_sf) +tm_dots(size = .25, col ="red") +tm_text(text ="Name", auto.placement = T) +tm_layout(title ="Main Cities of\nMadagascar")
#the city names are long so we have to # make a bigger window to fit them. This isn't part of the normal process#make an object with the current bounding boxbbox_new <-st_bbox(mdg)#calculate the x and y ranges of the bboxxrange <- bbox_new$xmax - bbox_new$xmin # range of x valuesyrange <- bbox_new$ymax - bbox_new$ymin # range of y values#provide the new values for the 4 corners of the bbox bbox_new[1] <- bbox_new[1] - (0.5* xrange) # xmin - left bbox_new[3] <- bbox_new[3] + (0.5* xrange) # xmax - right bbox_new[2] <- bbox_new[2] - (0.1* yrange) # ymin - bottom bbox_new[4] <- bbox_new[4] + (0.1* yrange) # ymax - top#convert the bbox to a sf collection (sfc)bbox_new <- bbox_new |># take the bounding boxst_as_sfc() # ... and make it a sf polygonggplot2::ggplot() +geom_sf(data = mdg) +geom_sf(data = cities_sf, color ="red") + ggrepel::geom_text_repel(data = cities_sf , aes(label = Name , geometry = geometry) , stat ="sf_coordinates" , min.segment.length =0) +coord_sf(xlim =st_coordinates(bbox_new)[c(1,2),1], # min & max of x valuesylim =st_coordinates(bbox_new)[c(2,3),2]) +# min & max of y values +labs(title ="Main Cities of\nMadagascar") +theme_void()
Final touches
Now that we have a map with cities plotted (we achieved our goal!), we will add a few finishing touches and set the size of the city points to the population variable in the original dataset.
Additionally, tmap provides a simple interface to go from a static map to an interative map simply by changing tmap_mode("plot") to tmap_mode("view").
#the city names are long so we have to # make a bigger window to fit them. This isn't part of the normal process#make an object with the current bounding boxbbox_new <-st_bbox(mdg)#calculate the x and y ranges of the bboxxrange <- bbox_new$xmax - bbox_new$xmin # range of x valuesyrange <- bbox_new$ymax - bbox_new$ymin # range of y values#provide the new values for the 4 corners of the bbox bbox_new[1] <- bbox_new[1] - (0.7* xrange) # xmin - left bbox_new[3] <- bbox_new[3] + (0.75* xrange) # xmax - right bbox_new[2] <- bbox_new[2] - (0.1* yrange) # ymin - bottom bbox_new[4] <- bbox_new[4] + (0.1* yrange) # ymax - top#convert the bbox to a sf collection (sfc)bbox_new <- bbox_new |># take the bounding box ...st_as_sfc() # ... and make it a sf polygontmap_mode("plot") +tm_shape(mdg, bbox = bbox_new) +tm_polygons() +tm_shape(cities_sf) +tm_dots(size ="Population", col ="red" , legend.size.is.portrait =TRUE) +tm_text(text ="Name", auto.placement = T , along.lines = T) +tm_scale_bar(position =c("left", "bottom"), width =0.15) +tm_compass(type ="4star" , position =c("right", "bottom") , size =2) +tm_layout(main.title ="Main Cities of Madagascar" , legend.outside =TRUE)
ggplot2::ggplot() +geom_sf(data = mdg) +geom_sf(data = cities_sf, aes(size = Population) , color ="red") + ggrepel::geom_text_repel(data = cities_sf , aes(label = Name , geometry = geometry) , stat ="sf_coordinates" , min.segment.length =0) +coord_sf(xlim =st_coordinates(bbox_new)[c(1,2),1], # min & max of x valuesylim =st_coordinates(bbox_new)[c(2,3),2]) +# min & max of y values + ggspatial::annotation_scale(location ="bl") + ggspatial::annotation_north_arrow(location ="br" , which_north ="true" , size =1)+labs(title ="Main Cities of Madagascar") +theme_void()
For those interested in mapping in R (or QGIS) there are many free resources available online. A great starting point for R is the online text book, Geocomputation with R. If you would rather learn more in Python, Geocomputation with Python is a great resource.